In the United States, wage stagnation has become a hot-button issue for many people in various fields of employment. Graduate students have been at the center of this issue in recent years- strikes for wage increases and cost-of-living adjustments have taken place at multiple universities throughout the country. Because PhD students often do not have the time to earn extra income (and their contracts often prohibit them from pursuing work elsewhere), how much they will earn from their stipend is a huge factor in considering where to pursue their research (Powell, 2004; Soar et al., 2022). Knowing how much My research question is: Is university ownership status (public vs. private) a predictor of the value of a PhD stipend?
Hypothesis
H₀:University ownership status is a predictor of the value of a PhD stipend.
H₁: University ownership status is not a predictor of the value of a PhD stipend.
Dataset
This dataset is comprised of self-reported survey data collected by PhDStipends.com. Respondents are asked their university, department, academic year, and year in the program. They are also asked whether they receive a 12-month or 9-month salary, gross pay, and required fees. PhDStipends automatically calculates the LW Ratio (living wage ratio), which is the stipend divided by the living wage of the county the university is located in.
In addition to this information, I also manually categorized universities by their ownership status as public or private, and assigned each program to 1 of five broader academic disciplines: Business/Policy, Social Science, Natural Science, Formal Science, and Humanities. Due to a computer issue much of my work was lost, so the dataset is currently incomplete. The analysis that follows is based on the information I was able to recover or reenter within a reasonable period of time.
The variables of interest for me are the ownership status, gross pay, program year, and academic discipline.
Rows: 12160 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): University, Status, Department, Category, AcYear
dbl (7): Pay, LW Ratio, ProgYear, 12 M Gross Pay, 9 M Gross Pay, 3 M Gross P...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
summary(csv)
University Status Department Category
Length:12160 Length:12160 Length:12160 Length:12160
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Pay LW Ratio AcYear ProgYear
Min. : 1 Min. :0.000 Length:12160 Min. :1.00
1st Qu.:20000 1st Qu.:0.880 Class :character 1st Qu.:1.00
Median :26000 Median :1.130 Mode :character Median :1.00
Mean :25765 Mean :1.095 Mean :2.05
3rd Qu.:31500 3rd Qu.:1.330 3rd Qu.:3.00
Max. :96000 Max. :4.120 Max. :6.00
NA's :47 NA's :422 NA's :1221
12 M Gross Pay 9 M Gross Pay 3 M Gross Pay Fees
Min. : 1 Min. : 15 Min. : 4 Min. : 1
1st Qu.: 24000 1st Qu.:16500 1st Qu.: 3000 1st Qu.: 500
Median : 29000 Median :20000 Median : 5000 Median : 1000
Mean : 28474 Mean :20128 Mean : 5194 Mean : 2030
3rd Qu.: 33000 3rd Qu.:24000 3rd Qu.: 6204 3rd Qu.: 2000
Max. :140000 Max. :87467 Max. :55816 Max. :93725
NA's :3632 NA's :8551 NA's :10951 NA's :7404
There are quite a few outliers for both categories, but we can see that median pay is higher in private universities than in public universities. There are also significantly more outliers below the 1st quartile in private universities than in public.
Hypothesis Testing
Explanatory Variable: Ownership Status (Status)
Response Variable: Gross Pay (Pay)
Control Variable: Academic Discipline (Category), Program Year (ProgYear)
First I will run a model for gross pay, using as.factor() to convert ownership status into dummy variables.
Code
fit1=lm(Pay ~as.factor(Status), data = csv)summary(fit1)
Call:
lm(formula = Pay ~ as.factor(Status), data = csv)
Residuals:
Min 1Q Median 3Q Max
-30328 -4729 668 4918 72671
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30332.2 130.8 231.81 <2e-16 ***
as.factor(Status)Public -7003.5 162.0 -43.22 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8494 on 12111 degrees of freedom
(47 observations deleted due to missingness)
Multiple R-squared: 0.1336, Adjusted R-squared: 0.1336
F-statistic: 1868 on 1 and 12111 DF, p-value: < 2.2e-16
Based on the p-values, it does seem that ownership status is statistically significant with regards to pay. Now I will plot this model.
Next I will create a model adding the control variable “Category” (academic discipline).
Code
fit2=lm(Pay ~as.factor(Status) + Category, data = csv)summary(fit2)
Call:
lm(formula = Pay ~ as.factor(Status) + Category, data = csv)
Residuals:
Min 1Q Median 3Q Max
-32473 -4293 375 4701 72535
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30191.0 176.8 170.757 < 2e-16 ***
as.factor(Status)Public -7118.8 159.0 -44.770 < 2e-16 ***
Category0 519.7 362.9 1.432 0.152228
CategoryBusiness/Policy 2089.8 592.7 3.526 0.000424 ***
CategoryFormal Science 393.0 250.6 1.568 0.116838
CategoryHumanities -3447.6 310.8 -11.093 < 2e-16 ***
CategoryNatural Science 2371.5 202.7 11.697 < 2e-16 ***
CategorySocial Science -1892.2 236.0 -8.019 1.16e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8310 on 12105 degrees of freedom
(47 observations deleted due to missingness)
Multiple R-squared: 0.1713, Adjusted R-squared: 0.1708
F-statistic: 357.4 on 7 and 12105 DF, p-value: < 2.2e-16
Formal Science, Humanities, and Natural Science all appear to be statistically significant. However, “Category0” is likely skewing the data, as this includes degree programs I have yet to assign to a category. The R-squared value here is higher than the previous model; however, due to the incomplete data, I will take this with a grain of salt.
Next I will create a model adding the control variable “ProgYear” (program year).
Code
fit3=lm(Pay ~as.factor(Status) + ProgYear, data = csv)summary(fit3)
Call:
lm(formula = Pay ~ as.factor(Status) + ProgYear, data = csv)
Residuals:
Min 1Q Median 3Q Max
-30564 -4621 469 4969 72741
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30719.46 177.92 172.659 <2e-16 ***
as.factor(Status)Public -7052.72 169.55 -41.597 <2e-16 ***
ProgYear -135.82 55.31 -2.456 0.0141 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8455 on 10896 degrees of freedom
(1261 observations deleted due to missingness)
Multiple R-squared: 0.1374, Adjusted R-squared: 0.1372
F-statistic: 867.5 on 2 and 10896 DF, p-value: < 2.2e-16
Program year does appear to be statistically significant. R-squared is comparable to the original model.
Finally, I will create a model using both control variables.
Code
fit4=lm(Pay ~as.factor(Status) + Category + ProgYear, data = csv)summary(fit4)
In this model, the disciplines of Business/Policy and Formal Science are the only ones which are not statistically significant.
Code
par(mfrow=c(2,3)); plot(fit1, which=1:6)
Code
par(mfrow=c(2,3)); plot(fit2, which=1:6)
Code
par(mfrow=c(2,3)); plot(fit3, which=1:6)
Code
par(mfrow=c(2,3)); plot(fit4, which=1:6)
The large number of categorical variables in my data makes plotting any model challenging, but from what I can see the fit is not great for any model. I am curious if a logit model would produce better results.
Summary
I need to reevaluate some of my variables and data and see if I can come up with a way to transform the data so that the models can be improved. I may experiment with relevel() and see if that has any effect. I also have yet to try an F-test.
Also, as previously mentioned, my data is incomplete- finishing the categorization of each degree program may improve my results.
Soar, M., Stewart, L., Nissen, S. et al. Sweat Equity: Student Scholarships in Aotearoa New Zealand’s Universities. NZ J Educ Stud (2022). https://doi.org/10.1007/s40841-022-00244-5
Source Code
---title: "Final Project Proposal"author: "Lindsay Jones"editor: visualdescription: Initial proposal for my final projectdate: "11/11/2022"format: html: toc: true code-fold: true code-copy: true code-tools: truecategories: - finalpart2---```{r}library(tidyverse)library(dplyr)```## Part 1### Research QuestionIn the United States, wage stagnation has become a hot-button issue for many people in various fields of employment. Graduate students have been at the center of this issue in recent years- strikes for wage increases and cost-of-living adjustments have taken place at multiple universities throughout the country. Because PhD students often do not have the time to earn extra income (and their contracts often prohibit them from pursuing work elsewhere), how much they will earn from their stipend is a huge factor in considering where to pursue their research (Powell, 2004; Soar et al., 2022). Knowing how much My research question is: **Is university ownership status (public vs. private) a predictor of the value of a PhD stipend?**### Hypothesis**H₀:University ownership status is a predictor of the value of a PhD stipend.****H₁: University ownership status is not a predictor of the value of a PhD stipend.**### DatasetThis dataset is comprised of self-reported survey data collected by PhDStipends.com. Respondents are asked their university, department, academic year, and year in the program. They are also asked whether they receive a 12-month or 9-month salary, gross pay, and required fees. PhDStipends automatically calculates the LW Ratio (living wage ratio), which is the stipend divided by the living wage of the county the university is located in.In addition to this information, I also manually categorized universities by their ownership status as public or private, and assigned each program to 1 of five broader academic disciplines: Business/Policy, Social Science, Natural Science, Formal Science, and Humanities. Due to a computer issue much of my work was lost, so the dataset is currently incomplete. The analysis that follows is based on the information I was able to recover or reenter within a reasonable period of time.The variables of interest for me are the ownership status, gross pay, program year, and academic discipline.```{r}library(readr)csv <-read_csv("~/School/UMASS/DACSS 603/Final Project/csv.csv")summary(csv)``````{r}print(summarytools::dfSummary(csv,varnumbers =FALSE,plain.ascii =FALSE,style ="grid",graph.magnif =0.70,valid.col =FALSE),method ='render',table.classes ='table-condensed')```## Part 2### VisualizationsI'll start with a histogram of all stipends, regardless of university ownership status.```{r}viz <- csv %>%filter(Status %in%c("Public", "Private")) hist(viz$Pay, breaks =10)```The distribution appears somewhat normal, with annual pay most frequently in the range of \$20,000 to \$30,000 annually.Next I will generate 2 boxplots: one for public universities, and one for private.```{r}viz %>%ggplot(aes(x=Status, y=Pay, fill=Status)) +geom_boxplot()```There are quite a few outliers for both categories, but we can see that median pay is higher in private universities than in public universities. There are also significantly more outliers below the 1st quartile in private universities than in public.### Hypothesis Testing- Explanatory Variable: Ownership Status (Status)- Response Variable: Gross Pay (Pay)- Control Variable: Academic Discipline (Category), Program Year (ProgYear)First I will run a model for gross pay, using as.factor() to convert ownership status into dummy variables.```{r}fit1=lm(Pay ~as.factor(Status), data = csv)summary(fit1)```Based on the p-values, it does seem that ownership status is statistically significant with regards to pay. Now I will plot this model.Next I will create a model adding the control variable "Category" (academic discipline).```{r}fit2=lm(Pay ~as.factor(Status) + Category, data = csv)summary(fit2)```Formal Science, Humanities, and Natural Science all appear to be statistically significant. However, "Category0" is likely skewing the data, as this includes degree programs I have yet to assign to a category. The R-squared value here is higher than the previous model; however, due to the incomplete data, I will take this with a grain of salt.Next I will create a model adding the control variable "ProgYear" (program year).```{r}fit3=lm(Pay ~as.factor(Status) + ProgYear, data = csv)summary(fit3)```Program year does appear to be statistically significant. R-squared is comparable to the original model.Finally, I will create a model using both control variables.```{r}fit4=lm(Pay ~as.factor(Status) + Category + ProgYear, data = csv)summary(fit4)```In this model, the disciplines of Business/Policy and Formal Science are the only ones which are not statistically significant.```{r}par(mfrow=c(2,3)); plot(fit1, which=1:6)``````{r}par(mfrow=c(2,3)); plot(fit2, which=1:6)``````{r}par(mfrow=c(2,3)); plot(fit3, which=1:6)``````{r}par(mfrow=c(2,3)); plot(fit4, which=1:6)```The large number of categorical variables in my data makes plotting any model challenging, but from what I can see the fit is not great for any model. I am curious if a logit model would produce better results.## SummaryI need to reevaluate some of my variables and data and see if I can come up with a way to transform the data so that the models can be improved. I may experiment with relevel() and see if that has any effect. I also have yet to try an F-test.Also, as previously mentioned, my data is incomplete- finishing the categorization of each degree program may improve my results.### References*Living Wage Calculator*. (n.d.). Retrieved October 10, 2022, from <https://livingwage.mit.edu/>Powell, K. Stipend survival. Nature 428, 102--103 (2004). <https://doi.org/10.1038/nj6978-102a>Emily Roberts & Kyle Roberts. (2022, October 10). PhD stipends [Dataset]. <http://www.phdstipends.com/csv>Soar, M., Stewart, L., Nissen, S. et al. Sweat Equity: Student Scholarships in Aotearoa New Zealand's Universities. NZ J Educ Stud (2022). <https://doi.org/10.1007/s40841-022-00244-5>